Research on prediction of compressive strength of fly ash and slag mixed concrete based on machine learning

Every year, a large amount of solid waste such as fly ash and slag is generated worldwide. If these solid wastes are used in concrete mixes to make concrete, it can effectively save resources and protect the environment. The compressive strength of concrete is an essential indicator for testing its quality, and its prediction is affected by many factors. It is difficult to predict its strength accurately. Therefore, based on the current popular machine learning supervised learning algorithms: Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Support Vector Machine (SVR), three models established a nonlinear mapping between multi-factor features and target feature concrete compressive strength. Using the three completed training models, we validated the test set with 206 example sets, and the Root Mean Square Error (RMSE), fitting coefficient (R2), and Mean Absolute Error (MAE) were used as evaluation metrics. The validation results showed that the values of RMSE, R2, and MAE for the RF model were 0.1, 0.9, and 0.21, respectively; the values of XGBoost model were 0.05, 0.95, and 0.15, respectively. The values of SVR were 0.15, 0.86, and 0.3, respectively. As a result, Extreme Gradient Boosting (XGBoost) has better generalization ability and prediction accuracy than the other two algorithms.


Introduction
Concrete is widely used in the construction industry because of its excellent performance [1]. Every year the whole world, because of the construction needs, the amount of use of concrete will be huge [2]. Thus, the carbon footprint associated with cement, and therefore conventional concrete, production is high. Increasing the use of supplementary cementitious materials (SCMs) in concrete is an obvious and necessary step to reduce carbon emissions [3,4]. Mohamed and Tayeb [5] have studied that concrete is generally mixed with cement, sand, stone, water, and other materials by a certain proportion of the formation of a rigid material. SCMs, such as fly ash and slag are often waste materials from industrial processes. The wide application of these solid wastes in concrete has become an effective measure to reduce the cement consumption and delay the hydration heat, its application not only has a good economy but also can make the concrete have better mixing performance, hardening properties, and durable performance [6][7][8]. Inclusion of SCMs in concrete lessens the environmental impact of concrete in several ways, in that it: (1) reduces cement consumption and thereby production, (2) can reduce the amount of inert filler (typically sand in conventional concrete) required and (3) uses waste materials that would otherwise be landfilled.
In the current studies, the machine learning-based model has been widely used in slope stability prediction [2], floods [9], prediction of mechanical properties of materials [10][11][12] and building structures [13,14]. In addition, many researchers have explored the application of machine learning in concrete prediction of compressive strengths. For example, the studies of Yeh et al. [15][16][17], built an ANN model with a backpropagation algorithm to predict the compressive strength of concrete at different ages (3 days, 14 days, 28 days, and 90 days). Fan and Chiong et al. [18] predicted the compressive strength of concrete with a support vector machine by studying the composition of concrete. The experimental results show that the model has good performance for the reverse prediction of concrete members under the conditions of multiple input, single output and multiple input and multiple output. Behnood and Golafshani [19] used decision trees to predict the compressive strength of concrete with fly ash and other waste materials; The results indicated that the proposed models could provide reliable predictions of the target mechanical properties. Deshpande and Londhe [20] estimated the compressive strength of recycled aggregate concrete with the artificial meridian; The results indicate that ANN learns from the examples and grasped the fundamental domain rules governing strength of concrete. At present, the research on making concrete with solid wastes such as fly ash and slag as mixing material needs to be further improved [21]. Because of the small sample size of previous studies, the over-fitting problem is easy to occur in model training. The data set samples used in this study reached 1030 groups, which has a better optimization for the over-fitting problem. Using limited data to predict the compressive strength of this type of concrete will help realize engineering applications. At the same time, compared with previous models, the application of a more advanced XGBoost model to a concrete compressive strength test is also helpful to improve the prediction accuracy and speed; In previous experiments also have the problem of lack of field experiment data validation, theoretical research should return to the practical application, most of these studies only compared to the existing data sets, without actually making a concrete test block, a lack of practicality, to solve this problem, after completing the model performance comparison, concrete test blocks will be made on-site. The compressive strength test is carried out after curing for a fixed time in the constant temperature curing box to verify the model's accuracy.

Research methods and steps
The mechanical properties of concrete under the action of multi-feature coupling are affected by many factors, which overlap and influence each other, showing a very complex nonlinear relationship [22]. In this paper, the prediction accuracy and practicability of the three machine learning models, XGBoost model, SVR model, and RF model, will be evaluated by comparing the curve fitting degree between the predicted value and the actual value of the three machine learning models, as well as statistical indicators such as RMSE, R 2 , and MAE, as well as experimental verification. The programming language used was Python3.9, programming analysis on the Notebook platform.
To better conduct model training and model verification, this study will follow the ideas shown in Fig 1, mainly including data processing, model training, model testing, model comparison, and experimental validation. The specific steps are as follows: 1. Data preprocessing, the data were first preprocessed, such as quality control and dataset partitioning, and correlation analysis.
2. In machine learning model construction, first build three models of RF, XGBoost, and SVR. After obtaining the initial model, use the training set samples to train the model, then use the divided test set samples to test the three models, and optimize through grid search parameters, so that the model performance is optimal. Finally, the model with the optimal parameters is selected [23].
3. Comparison of model prediction performance. By inputting the test set into three models and obtaining the model predictive value, draw the fitting curve, scatter chart, relative error chart and Taylor chart between the actual value and the predictive value, and calculate Root Mean Square Error (RMSE), coefficient (R 2 ), Mean Absolute Error (MAE). Then the prediction performance of the model is evaluated according to the graph and statistical indicators, and the model with the best comprehensive performance is obtained [24].
4. Example verification of optimal model. Make a concrete test block with the same characteristics as the data set, put the completed test block into a constant temperature curing box for 28 days, perform a compressive strength test on the cured concrete test block and record the experimental data; The values of the parameters are input into the optimal model, and the compressive strength is predicted; the average error rate is calculated by comparing the experimental data and the expected data, to test the performance of the model in the actual working state [25].

Sample data acquisition and preprocessing
The sample data were based on the Concrete Compressive Strength dataset [26]

Correlation analysis of sample data
Before model training, samples' features unrelated to concrete compressive strength should be excluded, so correlation analysis should be conducted. Correlation analysis needs to verify the distribution state of sample data; the histogram and normal curve of each characteristic value distribution is shown in Fig 2. There are three commonly used correlation coefficients: Pearson correlation coefficient, Spearman correlation coefficient, and Kendall correlation coefficient. The two variables using the Pearson correlation coefficient should follow a normal distribution and have a linear correlation trend; the Pearson correlation coefficient is used to     In the Eq (1): E is the mathematical expectation; D is the variance; ffi ffi ffi ffi D p is the standard deviation; Cov(X,Y) is the covariance of the sum of the random variables, which measures the overall error between the two variables; ρ XY is the value of the quotient between the covariance and standard deviation between the two variables, also known as the correlation coefficient between the variables X and Y. Variables X and Y in Eq (1) refer to two independent bodies of evidence [29].
If two variables do not obey normal distribution, the spearman correlation coefficient should be used to study their relationship. In Fig 2(B), 2(C), 2(E), 2(H) and 2(I) does not obey the normal distribution, so the Spearman correlation coefficient is used as the judgment basis of correlation. The Spielman correlation coefficient calculation expression is as follows [30,31]: In the Eq (2): ρ is the Spearman correlation coefficient, between -1 and 1, |ρ| more close to 1, the greater the relevance; D is the difference between the two data sequences; N is the number of data.
The correlation diagram calculated by Pearson correlation coefficient and Spearman correlation coefficient is shown in Fig 3. In Fig 3, Fig 3(A) is calculated from the Spearman correlation coefficient, and The Pearson correlation coefficient calculates Fig 3(B). It can be seen that the compressive strength of concrete is positively correlated with cement, blast furnace slag, superplasticizer, and age. The compressive strength of concrete is negatively correlated with coarse aggregate, fine aggregate, fly ash, and water. Among them, the compressive strength of concrete is more correlated with cement, water, superplasticizer, and age, and there are no

Normalization processing
There are nine features in the selected dataset, and the dimensions of the nine features are different. In order to improve the accuracy and stability of the model, reduce the amount of calculation, and achieve the best prediction effect, it is necessary to normalize the data in the dataset. Because the dataset is already fixed and new samples will not continue to be added, the maximum and minimum values of each feature will not change, so the normalization processing in this article uses the max-min mapping function, which is implemented with the help of the StandardScaler function under the sklearn library in Python, and the function form is as follows [32]: In the Eq (3): X � is the mapped numeric value; x represents the value in the original data set; min is the minimum value of each feature; max is the maximum value for each feature.

Extreme gradient boosting (XGBoost)
The extreme Gradient Boosting (XGBoost) model is based on the tree model, which is an improved model based on Gradient Boosting Decision Tree (GBDT). It is an addition formula composed of k base models proposed by Chen Tianqi, a scholar at the University of Washington. The basic idea is to use the new base model to fit the deviation of the previous model to continuously reduce the variation of the additive model [33]. By introducing regularization to the objective function and using the second derivative approximation, the XGBoost model effectively reduces the overfitting problem and improves the computational efficiency of the algorithm. Compared with other machine learning algorithms, the XGBoost algorithm is faster and more generalized. The prediction model formula is as follows: In the Eq (4): f t is the kth base model;ŷ i is the predicted value of the ith sample.
To avoid overfitting, the regularization term on a single base model is as follows: In the Eq (5): T is the leaf node of the tree; ω is the output value of x falling on a leaf node; γ and λ are non-negative coefficients; ω j is the output value of the j th node. Therefore, the XGBoost loss function is defined as: The minimum value of the loss function is obtained using Taylor's second-order expansion, and then the segmentation point with the highest score is searched by exact or approximate methods, and the next step is to segment and expand the leaf nodes [34].

Random Forest (RF)
The Random Forest (RF) model is based on the decision tree model under the bagging framework. The bagging with the decision tree as the primary model generates a decision tree after each bootstrap is put back into the sampling, there is no further intervention in developing these trees; random forest also performs bootstrap sampling, but the difference between it and bagging is that when generating each tree, each node variable is only generated in a few randomly selected variables, Therefore, not only the samples are random, but also the generation of each node variable is random; The advantage is that it can transform a limited number of weak classifiers into a strong classifier through linear combination, to improve the accuracy and robustness [35]. The problem studied in this paper is a regression problem. The random forest regression algorithm outputs the results of all decision trees and then takes the mean value. The regression equation is as follows: In the Eq (7): � HðxÞ is the prediction result; hi is a single decision tree; θ k is an independent distributed random variable that determines the growth process of a single decision tree; K is the number of decision trees.
Each decision tree in a random forest does not capture all features simultaneously, so each decision tree has uncertainty, so the model's generalization ability is increased.

Support Vector Regression (SVR)
The Support Vector Machine (SVM) model is a data mining model based on statistical theory, and its purpose is to find an optimal hyperplane to separate two different classes. When Support Vector Regression is used for regression problems, the model is called Support Vector Regression(SVR); when dealing with SVR nonlinear problems, the kernel function maps nonlinear problems in a low-dimensional space to a high-dimensional feature space, and finds the optimal hyperplane in the high-dimensional space, and then calculate the distance between all training set samples and this plane, so that the distance is the smallest, to achieve regression prediction [36,37]. In this paper, through multi-parameter verification, the kernel function used is RBF (Radial Basis Kernel Function), and the kernel function expression is as follows: In the Eq (8): σ is the width parameter of the RBF; exp is the exponential function with the natural constant e as the base; X-X i is the distance between the selected center points.
High-dimensional feature space decision function: In the Eq (9): W is an adjustable weight vector; φ(x) is a nonlinear mapping function; b is a constant.
Ignoring the fitting error smaller than ε, the SVR model can be transformed into a constrained optimization problem as follows.
8 > > > > < > > > > : In the Eq (10): c is the penalty factor; x i ;x � i are a pair of relaxation factors. By introducing the Lagrange function, the optimization problem of Eq (10) is transformed into a dual form, and the SVR objective equation is obtained through the transformation and solution as shown in Eq (11): In the Eq (11): a � and a are dual variables; K(X i ,X j ) = φ(x i )Tφ(x j ).

Model parameter selection
The optimal parameters of the three models are found by grid search [38]. Grid search is a model parameter optimization technique that performs an exhaustive search for the specified parameter values. First, the Cartesian product of the given parameter values is performed to obtain a combination of a finite set of parameters. Each set of parameters is used to train the model, and select a set of parameters with the best performance as the optimal parameters. The model is written in Python language, and the grid search method uses the GridSearchCV function under the sklearn library. The optimal parameters of the three models are shown in Table 3.

Model prediction results and performance analysis
The trained model is applied to the fitting of the verification set, and the fitting graph of the actual value and the predicted value is obtained as shown in In order to evaluate the comprehensive performance of the model more objectively and combined with the fact that the problem in this study is a regression in nature, the Root Mean Square Error (RMSE), fitting coefficient (R 2 ) and Mean Absolute Error (MAE) are adopted to evaluate indicators. The Root Mean Square Error (RMSE) is used to measure the degree of data change. The smaller the RMSE value is, the higher the accuracy is. The fitting coefficient R 2 is used to measure the fitting degree between the predicted value and the real value. The closer it is to 1, the better the model fitting effect is. MAE is the sum of the absolute value difference between the real value and the predicted value. The closer it is to 0, the better the model performance is. The evaluation indices of the three models are shown in Table 4. The  XGBoost model has the highest R 2 value and the lowest RMSE and MAE value, so it performs best. At the same time, the RF model has the second highest R 2 value and the second lowest RMSE and MAE values, which are lower than the XGBoost model in performance, and the SVR model performs poorly. The performance of traditional SVR and RF models is lower than XGBoost models. The three parameter formulas are as follows [39]:

RMSE ¼
ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 n In the Eqs (12)- (14): n is the number of data; b ti is the actual value of the t feature of the i th sample; b pi is the predicted value of the p feature of the ith sample, � b t and � b p are the average of the actual and predicted results, respectively.
The applied predictive models were assessed using the feasibility of correlation analysis. Scatter plot is one of the informative graphical presentations for examining the aptitude of prediction models [40,41]. Based on the exhibited results in Fig 5, Wherein, Fig 5(A) shows the scatter plot of the prediction results of XGBoost model, Fig 5(B) shows the scatter plot of the prediction results of RF model, and Fig 5(C) shows the scatter plot of the prediction results of SVR model. Comparing the three figures, it is found that XGBoost model can well mine the relationship between input and output after training, and the correlation between the prediction results and experimental results is closer to 1 compared with RF model and SVR model.
In the testing phase, a total of 206 samples were tested, and the relative error (RE) percentage was calculated for each sample, and the RE value can visually observe the stability of the prediction model and the error extremes, as shown in Fig 6, Fig 6(A) is the relative error percentage of the XGBoost model, Fig 6(B) is the relative error percentage of the RF model, Fig 6(C) is the relative error percentage of the SVR model. The XGBoost model is more stable, with the absolute value of the error percentage within 10%, while the absolute value of the error percentage of the RF and SVR models is within 27%; in terms of the error extremes, the absolute value of the XGBoost model is 8%, while the RF model reaches 26% and the SVR model reaches 27%, based on the RE% evaluation, the XGBoost model is more stable in prediction performance [42]. Fig 7 summarizes the general performances of the applied models in the form of the Taylor diagram [43]. This diagram simply expresses three main statistics including the correlation coefficient between the predicted values and measured data (as an angle in the polar plot), RMSE (as a radial distance from the observation point), and the ratio of the standard deviation of the predicted values (as a radial distance from the origin). As can be observed in Fig 7, the XGBoost molel is closer to the REF point in comparison with the other two predictive modelling approaches (RF, and SVR).

Experimental verification
In laboratory was configured with a 100×100×100mm concrete test block with the same characteristics as the dataset. The test block is placed in a constant temperature curing box and cured for 28 days at 18 degrees Celsius humidity of 95%, as shown in Fig 8. To prevent the occurrence of accidents such as damage and increase the number of samples to reduce the contingency, 10 groups of proportioning are prepared for this time, and 3 test blocks are made for each proportioning, with a total of 30 test blocks. The concrete compressive strength test was carried out on the test block with the ideal appearance among the three test blocks of each mixing ratio; the concrete mix ratio, predicted strength, and experimental strength was obtained, as shown in Table 5. Finally, the average error rate of 10 combinations of test blocks is calculated, and the average error rate is 3.5%, which is low, and the model's accuracy is in line with the actual engineering needs [44]. The formula for the average error rate is as follows: In the Eq (15): e is the average error rate; x � is the actual value; x is the predicted value; n is the total number of samples.

Conclusion
1. After the initial model training of the three machine learning models, the characteristic data are input into the model to predict the compressive strength of concrete. Then, the fitting curves of the actual values and predicted values of the three models are compared, and the data of performance indicators such as R 2 , RMSE and MAE are compared. The results show that XGBoost has better performance in predicting concrete compressive strength. R 2 , RMSE and MAE are 0.95, 0.05 and 0.15, respectively. R 2 is closer to 1, RMSE and MAE are closer to 0, and the error is smaller, more effective than SVR and RF models.
2. In the laboratory will insert fly ash and ground slag as admixture into concrete test block, using the hydraulic universal testing machine to validate its compressive strength, and use the trained XGBoost model to predict the compressive strength, the results show that the XGBoost models to predict the concrete compressive strength performance good, easy to model, computing speed is fast. The actual average error rate is only 3.5%, which provides an effective method for predicting the compressive strength of concrete with a known mix ratio in advance.  3. By using traditional solid waste fly ash and blast furnace slag as a concrete admixture to make concrete, the concrete finally obtained has high compressive strength and can be used stably for a long time, which meets the requirements of engineering strength. It has guiding significance for engineering applications and realizes the goal of recycling solid waste such as fly ash and slag. 4. The machine learning model is used to predict the compressive strength of concrete. The compressive strength data of concrete made using fly ash, and blast furnace slag as concrete mixing material can be obtained through fewer experiments. Compared with the previous experiments, concrete samples are prepared in large quantities to test their compressive strength. Predicting its compressive strength through a machine learning model saves experimental costs and resources.

Limitation and future research direction
1. This paper investigates the compressive strength of concrete with fly ash and slag as admixture materials. The mechanical properties, such as tensile and flexural strength, are not investigated. They need to be further explored to establish a complete predictive model for concrete mechanics of solid waste materials such as fly ash and slag.
2. In this paper, the compressive strength of concrete was investigated using the ratio of eight different materials as the input parameters of the prediction model. Still, many factors affect the compressive strength of concrete with solid waste materials such as fly ash and slag. There are constraints among the factors so that subsequent studies can consider as many different factors as possible as input parameters, such as aggregate particle size, each component of solid waste materials ratio, and other factors.